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WORD SEGMENTATION IN CHINESE TEXT 



TECHNICAL FIELD 

The invention relates generally to the field of natural language processing, 
and, more specifically, to the field of word segmentation. 

5 BACKGROUND OF THE INVENTION 

Word segmentation refers to the process of identifying the individual 
words that make up an expression of language, such as text. Word segmentation is useful 
for checking spelling and grammar, synthesizing speech from text, and performing 
natural language parsing and understanding, all of which benefit from an identification of 

10 individual words. 

Performing word segmentation of English text is rather straightforward, 
since spaces and punctuation marks generally delimit the individual words in the text. 
Consider the English sentence in Table 1 below. 

15 The motion was then tabled-that is, removed 

indefinitely from consideration. 

Table 1 

By identifying each contiguous sequence of spaces and/or punctuation marks as the end 
20 of the word preceding the sequence, the English sentence in Table 1 may be 
straightforwardly segmented as shown in Table 2 below. 

The motion was then tabled - that is, removed. 
indefinitely from consideration . 

25 . Table 2 

In Chinese text, word boundaries are implicit rather than explicit. 
Consider the sentence in Table 3 below, meaning "The committee discussed this problem 
yesterday afternoon in Buenos Aires." 

30 
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Table 3 

Despite the absence of punctuation and spaces from the sentence, a reader of -Chinese 
would recognize the sentence in Table 3 as being comprised of the words separately 
5 underlined in Table 4 below. 

Table 4 

10 It can be seen from the examples above that Chinese word segmentation 

cannot be performed in the same manner as English word segmentation. An accurate and 
efficient approach to automatically performing Chinese segmentation would nonetheless 
have significant utility. 

SUMMARY OF THE INVENTION 
15 The present invention provides a facility for selecting from a sequence of 

natural language characters combinations of characters that may be words. The facility - 
uses probability indications for each of a plurality of words as a function of adjacent 
characters. 

One aspect of the present invention is a method in a computer system for 
20 identifying individual words occurring in a sentence of text. The method includes the 
steps of: for each of a plurality of words, storing an indication of probability of whether 
the word occurs in natural language text as a function of adjacent characters; and for each 
of a plurality of contiguous groups of characters occurring in the sentence, determining 
overlapping possible words, ascertaining probability based on the stored indication and 
25 adjacent characters and submitting the groups of characters determined to be possible 
words to a parser with an indication of probability. A computer readable medium for 
storing the instructions implementing the same is also provided. 
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The second aspect of the present invention includes computer memory containing 
a word segmentation data structure for use in identifying individual words occurring in 
natural language test. The data structure includes for each of a plurality of words an 
indication of probability of whether the word occurs in natural language text as a function 
5 of adjacent characters. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a high-level block diagram of the general-purpose computer 
system upon which the facility preferably executes. 

Figure 2 is an overview flow diagram showing the two phases in which th<? 
10 facility preferably operates. 

' Figure 3 is a flow diagram showing the steps preferably performed by the 
facility in order to augment the lexical knowledge base in the initialization phase to 
include information used to perform word segmentation. 

Figure 4 is a flow diagram showing the steps preferably performed in order 
15 to determine whether a particular word can contain other, smaller words. 

Figure 5 is a flow diagram of the steps preferably performed by the facility 
in order to segment a sentence into its constituent words. 

Figure 6 is a flow diagram showing the steps preferably performed by the 
facility in order to add multiple-character words to the word list. 
20 Figure 7 is a flow diagram showing the step preferably performed by the 

facility in order to test the NextChar and CharPos conditions for a word candidate. 

Figure 8 is a flow diagram showing the steps preferably performed by the 
facility in order to determine whether the last character of the current word candidate 
overlaps with another word candidate that may be a word. 
25 Figure 9 is a flow diagram showing the steps preferably performed by the 

facility in order to add single-character words to the word list. 
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Figure 10 is a flow diagram showing the steps preferably performed by the 
facility in order to assign probabilities to the lexical records generated from the words in 
the word list in accordance with a first approach. 

Figure 1 1 is a flow diagram showing the steps preferably performed by the 
5 facility in order to assign probabilities to the lexical records generated from the words in 
the word list in accordance with a second approach. 

Figure 12 is a . parse tree diagram showing a parse tree generated by the 
parser representing the syntactic structure of the sample sentence. 

DETAILED DESCRIPTION OF THE INVENTION ■ 

10 The present invention provides word segmentation in Chinese text In a 

preferred embodiment, a word segmentation software facility ("the facility") provides 
word segmentation for text in unsegmented languages such as Chinese by (1) evaluating 
the possible combinations of characters in an input sentence and discarding those unlikely 
to represent words in the input sentence, (2) looking up the remaining combinations of 

15 characters in a dictionary to determine whether they may constitute words, and (3) 
submitting the combinations of characters determined to be words to a natural language 
parser as alternative lexical records representing the input sentence. The parser generates 
a syntactic parse tree representing the syntactic structure of the input sentence, which 
contains only those lexical records representing the combinations of characters certified 

20 to be words in the input sentence. When submitting the lexical records to the parser, the 
facility weights the lexical records so that longer combinations of characters, which more 
commonly represent the correct segmentation of a sentence than shorter combinations of 
characters, are considered by the parser before shorter combinations of characters. 

In order to facilitate discarding combinations of characters unlikely to 

25 represent words in the input sentence, the facility adds to the dictionary, for each 
character occurring in the dictionary, (1) indications of all of the different combinations 
of word length and character position in which the word appears, and (2) indications of 
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all of the characters that may. follow this character when this character begins a word. 
The facility further adds (3) indications to multiple-character words of whether sub-words 
within the multiple-character words are viable and should be considered. In processing a 
sentence, the facility discards (1) combinations of characters in which any character is 
5 used in a word length/position combination not occurring in the dictionary, and 
(2) combinations of characters in which the second character is not listed as a possible 
second character of the first character. The facility further discards (3) combinations of 
characters occurring in a word for which sub-words are not to be considered. 

In this manner, the facility both minimizes the number of character 

10 combinations looked up in the dictionary and utilizes the syntactic context of the sentence 
to differentiate between alternative segmentation results that are each comprised of valid 
words. ■■..:■•'<-■ 

Figure .1 is a high-level block diagram of the general-purpose computer 
system upon which the facility preferably. executes. The computer system 100 containsa 

15 central processing unit (CPU) 110, input/output devices 120, and a computer memory 
(memory) 130. Among the input/output devices is a storage device 121, such as a hard- 
disk drive; a computer-readable media drive 122, which can be used to install software 
products, including the facility, which are provided on a computer-readable medium, such . 
as a CD-ROM; and a . network connection 123, through which the computer system 100 

20 may communicate with other connected computer systems (not shown). The memory 130 
preferably contains a word segmentation facility 131 for identifying individual words 
occurring in Chinese text, a syntactic parser 133 for generating a parse tree representing 
the syntactic structure of a sentence of natural language text from lexical records 
representing the words occurring in the natural language text, and a lexical knowledge 

25 base 132 for use by the parser in constructing lexical records for a parse tree and for use 
by the facility to identify words occurring in natural language text. While the facility is 
preferably implemented on a computer system configured as described above, those 
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skilled in the art will recognize that it may also be implemented on computer systems 
having different configurations. 

Figure 2 is an overview flow diagram showing the two phases in which the 
facility preferably operates. In step 201, as part of an initialization phase, the facility 
5 augments a lexical knowledge base to include information used by the facility to perform 
word segmentation. Step 201 is discussed in greater detail below in conjunction with 
Figure 3. Briefly, in step 201, the facility adds entries to the lexical knowledge base for 
the characters occurring in any word in the lexical knowledge base. The entry added for 
each character includes a CharPos attribute that indicates the different positions at which 

10 the character appears in words. The entry for each character further contains a NextChars 
attribute that indicates the set of characters that occur in the second position of words that 
begin with the current character. Finally, the facility also adds an IgnoreParts attribute to 
each word occurring in the lexical knowledge base that indicates whether the sequence of 
characters comprising the word should ever be considered to comprise smaller words that 

15 together make up the current word. 

After step 201, the facility continues in step 202, ending the initialization 
phase and beginning the word segmentation phase. In the word segmentation phase, the 
facility uses the information added to the lexical knowledge base to perform word 
segmentation of sentences of Chinese text. In step 202, the facility receives a sentence of 

20 Chinese text for word segmentation. In step 203, the facility segments the received 
sentence into its constituent words. Step 203 is 1 discussed in greater detail below in 
conjunction with Figure 5. Briefly, the facility looks up in the lexical knowledge base a 
small fraction of all the possible contiguous combinations of characters in the sentence. 
The facility then submits to a syntactic parser the looked-up combinations of characters 

25 that are indicated to be words by the lexical knowledge base. The parser, in determining 
the syntactic structure of the sentence, identifies the combinations of characters intended 
to comprise words in the sentence by its author. After step 203, the facility continues at 
step 202 to receive the next sentence for word segmentation. 
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Figure 3 is a flow diagram, showing the steps preferably performed by the 
facility in order to augment the lexical knowledge base in the initialization phase to 
include information used to perform word segmentation. These steps (a) add entries to 
the lexical knowledge base for the characters occurring in words in the lexical knowledge 
5 base; (b) add CharPos and NextChars attributes to the character entries in the lexical 
knowledge base; (c) add the IgnoreParts attribute to the entries for words in the lexical 
knowledge base. 

In steps 301-312, the facility loops through each word entry in the lexical 
knowledge base. In -step 302, the facility loops through each character position in the 

10 word. That is, for a word containing three characters, the facility loops through the first, 
second, and third characters of the word. In step 303, if the character in the current 
character position has an entry in the lexical knowledge base, then the facility continues 
in step 305, else the facility continues in step 304. In step 304, the facility adds an entry 
to the lexical knowledge . base for the current character. After step 304, the facility 

15 continues in step 305. In step 305, the facility adds an ordered pair to the CharPos 
attribute stored in the character's entry in the lexical knowledge base to indicate that the 
character may occur in the position in which it occurs in the current word. The ordered' 
pair added has the form (position, length), where position is the position that the character 
occupies in the word and lertgth is the number characters in- the word. For example, 

20 for the character "3?" in the word "3? m " the facility will add the ordered pair (1, 3) 
to the list of ordered pairs stored in the CharPos attribute in the lexical knowledge base 
entry for the character "31." The facility preferably does not add the ordered pair as 
described in step 305 if the ordered pair is already contained in the CharPos attribute for 
the current word. In step 306, if additional characters remain in the current word. to be 

25 processed, then the facility continues in step 302 to process the next character, else the 
' facility continues in step 307. 

In step 307, if the word is a single character . word, then the facility 
continues in step 309, else the facility continues in step 308. In step 308, the facility adds 
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a character in the second position of the current word to the list of characters in the 
NextChars attribute in the lexical knowledge base record for the character in the first 
position of the current word. For example, for the word "§| ffi #, " the facility adds the 
character "in" to the list of characters stored for the NextChars attribute of the character 

5 "31." After step 308, the facility continues in step 309. 

In step 309, if the current word can contain other, smaller words, then the 
facility continues in step 311, else the facility continues in step 310. Step 309 is 
discussed in further detail below in conjunction with Figure 4. Briefly, the facility 
employs a number of heuristics to determine whether an occurrence of the sequence of 

10 characters that make up the current word may in some context make up two or more 
smaller words. 

In step 310, the facility sets an IgnoreParts attribute for the word in the 
lexical knowledge base entry for the word. Setting the IgnoreParts attribute indicates 
that, when the facility encounters this word in a sentence of input text, it should not 

15 perform further steps to determine whether this word contains smaller words. After step 
310, the facility continues in step 312. In step 311, because the current word can contain 
other words, the facility clears the IgnoreParts attribute for the word, so that the facility, 
when it encounters the word in a sentence of input text, proceeds to investigate whether 
the word contains smaller words. After step 311, the facility continues in step 312. In 

20 step 312, if additional words remain in the lexical knowledge base to be processed, then 
the facility continues in step 301 to process the next word, else these steps conclude. 

When the facility performs the steps shown in Figure 3 to augment the 
lexical knowledge base by assigning CharPos and NextChars attributes to each character, 
it assigns these attributes to the characters occurring in- the sample sentence shown in 

25 Table 3 as shown below in Table 5. 



Character 


CharPos 


NextChars 




(1,2) (1,3) (3,4) 
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(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 
(3,4) (4,4) 


-is m ^ 




(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 
(3,4) (4,4) 


R ~S HE 


*e 


(1,2) (2,2) (2,3) (2,4) 


# M 




(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (3,4) - 
(4,4) (3,5) 
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(1,2) (2,2) (2,3) (3,3) (2,4) (3,4) (4,4) 
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(4,4) 


Jg ^ £8 — ' 
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(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 
(3,4) (4,4) (1,5) (2,5) (3,5) (4,5) (1,6) 
(2,6). (1,7) 


tin T ••• ••• 




(1,2) (2,2) (2,3) (3,3) (2,4) (3,4) (4,4) 
(3,6) (2,7) 


• ^ n i$ •-• 


m 


(1,2) (2,2) (1,3) (2,3) (3,3) (2,4) (3,4) 
(4,4) (3,7) 


K T 1 i? f 


■ m ■ 


(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 
(3,4) (4,4) (1,5) (2,5) (3,5) (4,5) (5,5) 

(1.6) (3,6) (4,6) (5,6) (6,6) (4,7) (5,7) 

(6.7) (7;7) 


i £1 fi£ ••• 




(1,2) (2,2) (1,3) (3,4) (4,4) (1,5) (5,7) 


tt t m - 




(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 
(3,4) (4,4) (2,5) (3,5) (4,5) (5,6) (6,7) 


m tt ^ - . 


m . 


(1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 
(3,4) (4,4) (1,5) (2,5) (3,5) (4,5) (5,5) 

(1.6) (3,6).(4,6) (5,6) (6,6) (4,7) (5,7).. 

(6.7) (7,7) 
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Character 


CharPos 


NextChars 




(1,2) (1,3) (1,4) 


i£ JL ^ — 




( 1,2) (2,2) (1,3) (2,3) (3,3) (1,4) (2,4) 


S'JJLIH- 




(1,2) (2,2) (1,3) (2,3) (1,4) (3,4) (4,4) 


-fc^it •••«••• 




(1,2) (2,2) (2,3) (3,3) (2;4) (4,4) 


St M ffl "• 



Table 5 : Character Lexical Knowledge Base Entries 



It can be seen from Table 5, for instance, from the CharPos. attribute of the character 
that this character can appear as the first character of words that are 2, 3, or 4 characters 
long. It can further be seen from the NextChars attribute' of the character "B£" that, in 
5 words beginning with this character, the second character may be either "JL," w ^," or 

"Bfc." . . : 

Figure 4 is a flow diagram showing the steps preferably performed in order 
to determine whether a particular word can contain other, smaller words. As an analogy 
to English, if spaces and punctuation characters were removed from an English sentence, 

i0 the sequence of characters "beat" could be interpreted either as the word "beat" or as the 
two words "be" and "at." In step 401, if the word contains four or more characters, then 
the facility continues in step 402 to retuin the result that the word cannot contain other 
words, else the facility continues in step 403. In step 403, if all the characters in the word 
can constitute single-character words, then the facility continues in step 405, else the 

15 facility continues in step 404 to return the result that the word cannot contain other words. 
In step 405, if the word contains a word frequently used as a derivational affix, that is, a 
prefix or a suffix, then the facility continues in step 406 to return the result that the word 
cannot contain other words, else the facility continues in step 407. In step 407, if an 
adjacent pair of characters' in the word are often divided when they appear adjacently in 

20 text of the language, then the facility continues in step .409 to return the result that the 
word can contain other words, else the facility continues in step 408 to return the result 
that the word cannot contain other words. 
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The results of determining whether particular words can contain other, 
smaller words are shown below in Table 6. 



Word 


IgnoreParts 




set 


. 5>cT. ■ 


clear 




set 


- • mm^ ; 


clear 


m& ■ 


set 


m'M.mmxnm ... 


set • 


■ • :■ ■ itte 


set 




clear 




set 



Table 6: Word Lexical Knowledge Base Entries 



5 ...... ^ .... . . .. . 

For example, it can be seen from Table 6 that the facility has determined that the word 
"^5^" cannot contain other words, while the word "^cT" may contain other words. 

Figure 5 is a flow diagram-of the steps preferably performed by the facility ' 
in order to segment a sentence into its constituent words. These steps generate a word list 

10 identifying different words of the language that , occur in the sentence. The word list is 
then submitted to a parser to identify the subset of words in the word list that were 
intended to comprise the sentence by its author; 

In step 501, the facility adds to the word list multiple-character words 
occurring in the sentence. Step 501 is discussed in greater detail below in conjunction 

15 with Figure d. In step 502, the facility adds to the word list the single-character words 
occurring in the sentence. * Step 502. is discussed in greater detail below in conjunction 
with Figure 9; In step 503, the facility generates lexical records used by the lexical parser 
for the words that have been added to the word list in steps 501 and 502. In step 504, the 
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facility assigns probabilities to the lexical records. The probability of a lexical record 
reflects the likelihood that the lexical record will be part of a correct parse tree for the 
sentence, and is used by the parser to order the application of the lexical records in the 
parsing process. The parser applies the lexical records during the parsing process in 

5 decreasing order of their probabilities. Step 504 is discussed in greater detail below in 
conjunction with Figure 10. In step 505, the facility utilizes the syntactic parser to parse 
the lexical records in order to produce a parse tree reflecting the syntactic structure of the 
sentence. This parse tree has a subset of the lexical records generated in step 503 as its 
leaves. In step 506, the facility identifies as words of the sentence the words represented 

10 by the lexical records that are the leaves of the parse tree. After step 506, these steps 
conclude. 

Figure 6 is a flow diagram showing the steps preferably performed by the 
facility in order to add multiple-character words to the word list. These steps use a current 
position within the sentence in analyzing the sentence to identify multiple-character 

15 words. These steps further utilize the CharPos, NextChar, and IgnoreParts attributes 
added to the lexical knowledge base by the facility as shown in Figure 4. In accordance 
with a first preferred embodiment, the facility retrieves these attributes from a lexical 
knowledge base on an as-needed basis durin? the performance of the steps shown in 
Figure 6. In a second preferred embodiment, the values of the NextChar attributes and/or 

20 the CharPos attributes of the characters in the sentence are all pre-loaded before the 
performance of the steps shown in Figure 6. In conjunction with the second preferred 
embodiment, a 3-dimensional array is preferably stored in the memory that contains the 
value of the CharPos attribute for each character occurring in the sentence. This array 
indicates, for a character at a given position in the sentence, whether the character may be 

25 at a given position in a word of a given length. Caching the values of these attributes 
allows them to be officially accessed when performing the steps shown in Figure 6. 
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In step 601, the facility sets this position at the first character of the 



sentence. In step 602-614, the facility continues to repeat steps 603-613 until the position 
has advanced to the end of the sentence. 



5 begins at the current position. The facility preferably begins with the- word candidate that 
starts at the current position and is seven characters long, and, in each iteration, removes 
one character from the end of the word candidate until the word candidate is two 
characters long. If there are fewer than seven characters remaining in the sentence 
beginning from the . current position^ .the facility preferably omits the iterations for the 

10 word candidates for which there are insufficient characters remaining in the sentence. In 
step 604, the facility tests for the current word candidate conditions relating to the 
NextChar and CharPos attributes of the characters comprising the word candidate. Step 
604 is discussed in greater detail below in ./ conjunction with Figure 7. If both the 
NextChar and CharPos conditions are satisfied for the word candidate, then the facility 

15 continues in step 605, else the facility continues in step 609. In step 605, the facility" 
looks up the word candidate in the lexical knowledge base to determine whether the word 
candidate is a word. In step 606, if the word candidate is a word, then the facility 
continues in step 607, else the facility continues in step 609. In step 607, the facility adds 
the word candidate to the list of words occurring in the sentence. In step 608, if the word 

20 candidate may contain. other words, i.e., if the IgnoreParts attribute for the word is clear, 
then the facility continues in step 609, else the facility continues in step 611. In step 609, 
if additional word candidates remain to processed, then the facility continues in step 603 
to process the next word candidate, else the facility continues in step 610. In step 610, 
the facility advances the current position one character toward the end of the sentence. 

25 After step 610, the facility continues in step 614. • 



another word candidate that may also be a word, then the facility continues in step 613, 
else the facility continues in step 612. Step 611 is discussed in greater detail below in 



In steps 603-609, the facility loops through each word candidate that 



In step 611, if the last character of the word candidate overlaps with 
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conjunction with Figure 8. In step 612, the facility advances the position to the character 
in the sentence after the last character of the word candidate. After step 612, the facility 
continues in step 614. In step 613, the facility advances the position to the last character 
of the current word candidate. After step 613, the facility continues in step 614. In step 

5 614, if the position is not at the end of the sentence, then the facility continues in step 602 
to consider a new group of word candidates, else these steps conclude. 

Figure 7 is a flow diagram showing the step preferably performed by the 
facility in order to test the NextChar and CharPos conditions for a word candidate. In 
step 701, if the second character of the word candidate is in the NextChar list of the first 

10 character of the word candidate, then the facility continues in step 703, else the facility 
continues in step 702 to return the result that the conditions are both satisfied. In steps 
703-706 the facility loops through each character position in the word candidate. In step 
704, if the ordered pair made up of the current position and the length of the word 
candidate is among the ordered pairs in the CharPos list for the character in the current 

15 character position, then the facility continues in step 706, else the : facility continues in 
step 705 to return the result that the conditions are not both satisfied. In step 706, if 
additional character positions remain in the word candidate to be processed, then the 
facility continues in step 703 to process the next character position in the word candidate, 
else the facility continues in step 707 to return the result that both conditions are satisfied 

20 by the word candidate: 

Figure 8 is a flow diagram showing the steps preferably performed by the 
facility in order to determine whether the last character of the current word candidate 
overlaps with another word candidate that may be a word. In step 801,. if the character 
after the word candidate is in the list of characters in the NextChar attribute for the last 

25 character of the word candidate, then the facility continues in step 803, else the facility 
continues in step 802 to return the result that there is no overlap. In step 803, the facility 
looks up in the lexical knowledge base. the. word candidate without its last character in 
order to determine whether the word candidate without its last character is a. word. In 
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step 804, if the word candidate without its last character is a word, then the facility 
continues in step 806 to return the result that there is overlap, else the facility continues in 
step 805 to return the result that there is no overlap. 

The performance of the steps shown in Figure 6 with respect to the 
5 example as shown below in Table 7. 



number 


- combination 


CharPos 


NextChars 


look up? 


is a word? 


1 




tail on pp 


pass 


no 


no 


2 




fail on 


pass 


no 


no 


3 




fail on 


pass 


*'" no 


no 


4 




fail on &f 


pass 


no 


no 


5 


nA- -r r r 


pass * 


pass 


yes 


no. 


6 




pass 


pass 


yes 


yes 


7 




fail on 7C 


pass 


no 


no 


8 




fail on ?c 


pass 


no 


no 


9 




fail on ^ 


pass 


no 


no 


10 




fail on 


pass 


no 


no 


1 1 




fail on 


pass . 


no 


no 


12 




pass 


pass 


. yes 


yes 


13 




fail on ~F 


pass 


no . 


no 


14 




fail on ~F 


pass 


no 


no 


15 




fail on ~F 


pass 


no 


no 


16 




pass 


pass 


yes 


no 


17 




pass 


pass 


yes . 


no 


.18 




pass 


pass 


• yes 


yes - 


19 




fail on § 


pass 


no ' 


no 


20 




fail on 3c 


pass 


. no 


no 


21 




fail on §? 


pass 


no 


no 
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nuinotr 


combination 


CharPos 


NextChars 


ook up? 


is a word? 




22 




fail on it 


pass 


no 


no 


23 




pass 


pass 


yes 


yes 


24 


mm 


pass 


pass 


yes ' 


yes 


25 




fail on # 


iaii 


no 


nu 


26 




fail on # 


X* * 1 

fail 


no 


no 


27 




fail on # 


fail 


no 


no 


28 




pass 


fail 


no - 


no 


29 




pass 


tail 


no 


no 


30 




pass 


tail 


no 


no 


31 




fail on %E 


fail 


no- 


no 


32 




fail on 


fail 


no 


no 


33 




fail on ^ 


fail 


no 


no 


34 




pass 


fail 


no 


no 


35 




pass 


fail 


no 


no 


36 




pass 


laii . 


no 


no 


37 




pass 


pass 


yes 


yes 


38 


fcI-ifc7i£<h|sjfiS 


fail on 


pass 


no 


t\c\ 
ll\J 


39 




fail on 


pass 


no 


no 


40 


tti£7i*^ 


fail on 


pass 


no 


no 


•41 


tti67i* 


fail on j2 


pass 


no 


no 


42 


-tfi67 


pass 


pass 


y£s 


no 


43 




pass 


pass 


yes 


yes 


44 




fail on j 


tail 


no 


no 




~r ^ A^- iBI 


fail on S 


fail 


no 


no 


46 




fail on 3* 


fail 


no 


no 


47 




fail on 


fail 


no 


no 


48 




pass 


pass 


yes 


no 


49 




fail on l°] 


pass 


no 


no 
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number 


combination 


CharPos 


NextChars 


look up? 


is a word? 


50 




pass 


pass 


yes 


yes 


51 




pass 


" " fail " 


no 


no 


52 


>M5J 


pass 


' fail 


no 


no 


53 




pass 


pass 


yes 


yes 



Table 7: Character Combinations Considered 



Table 7 indicates, for each of the 53 . combinations of characters from the sample sentence 
considered by the facility: the result of the CharPos test, the result of the NextChars test, 
whether the facility looked up the word in the lexical knowledge base, and whether the 
5 lexical knowledge base indicated that the combination. of characters is a word. 

It can be seen that, combinations 1-4 failed the CharPos test because the 
CharPos attribute of the character "Hfc" does not contain the ordered pairs.(l,7), (1, 6), (1, 
5), or (1, 4). For. combinations 5 and 6, on the. other hand, both .the CharPos and 
NextChars tests are passed. The facility therefore looks up combinations 5 and 6 in the 

10 lexical knowledge base, to determine that combination 5 is not a word, but combination 6 
is a word. After processing combination 6, and determining -how far to advance the 
current position, the facility determines that the -IgnoreParts attribute is set, but that the 
word <C B£ overlaps with a word candidate beginning with the character-";?^-" The 
facility therefore advances to the character "3^" at the end -of combination 6 in 

15 accordance with step 613. In combinations 7-12, only combination 12 passes the 
CharPos and NextChars tests. Combination 12 is therefore looked up and determined to 
be a word: After processing combination 12, and determining how far to -advance the 
current position, the facility determines that the IgnoreParts attribute of the word 
constituted by combination 12 is clear, and therefore advances the current position one 

20 character to the character "HF" rather than to the character following combination 1 2. 

It can further be seen that combinations 18, 24, 37, and 43 are words that 
have their IgnoreParts attribute sefand do not overlap in their final characters with any 
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word candidates that may be words. After processing each, therefore, the facility 
advances the current position to the character following the character combination in 
accordance with step 612, thereby omitting to process unnecessarily up to 41 additional 
combinations for each of these four combinations. 

It can further be seen that the IgnoreParts attributes of the words 
constituted by combinations 23 and 50 are clear. For this reason, the facility advances the 
current position only one character in accordance with step 610 after processing these 
combinations. 

It can fiirther.be seen that the two-character combinations 30, 36, 47, and 
52 are not determined by the facility to constitute words. The facility therefore advances 
the current position only one character after processing these combinations in accordance 
with step 610. In all, the facility looks up only 14 of 1 12 possible combinations in the 
sample sentence. Of the 14 combinations looked up by the facility, nine are in fact real 
words. 

As shown below in Table 8, after the processing described in conjunction 
with Table 7, the word list contains the words constituted by combinations 6, 12, 18, 23, 
24, 37, 43,50, and 53. 



Number 


Word 


part of speech 


6 




noun 


12 




noun 


18 




noun 


24 




noun 


23 




noun 


37 




noun 


43:. 




verb 


50 




pronoun 
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Number 


Word 


part of speech 


53 




noun 



Table 8: Word List with Multiple-Character Words 



Figure 9 is a flow diagram showing the steps preferably performed by the 
facility in order to add single-character words to the word list. In steps 901-906, the 
facility loops through each character in the sentence, from the first character to the last 
5 character. In step 902, the, facility determines, based on its entry in the lexical knowledge 
base, whether the character comprises a single-character word, else the facility continues 
in step 906 without adding a character to the word list. • If the character comprises a 
single-character word,- then the facility continues in step 903, else the facility continues in 
step 906 without adding the character to the word list. In step 903* if the character is 

10 contained in a word that may not contain other words, i.e., a word already on the word list 
has its IgnoreParts attribute set^then the facility continues in step 904, else the facility 
continues in step 905 to add the character to the word list. In step 904, if the character is 
contained in a word on the word list that overlaps with another word on the word list, 
then the facility continues in step 906 without adding the character to the word list, else 

15 the facility continues in step 905. In step 905, the facility adds the single-character word 
comprising the current character to the word list. In step 906, if additional characters 
remain in the sentence to be processed, then the facility continues in step 901 to process 
the next character in the sentence, else these steps conclude. 

Table 9 below shows that, in performing the steps shown in Figure 9, the 

20 facility adds single-character words 54-61 to the word list. 



Number 


Word 


part of speech 


6 




noun 


54 


ft 


morpheme 


55 


5*C 


noun 
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Number 


Word 


part of speech 


12 




noun 


56 


T 


noun (localizer) 


18 




noun 


24 


mm 


noun 


23 




noun 


57 




noun 


"57 




verb 


58 




verb 


58 




preposition - 


58 




adverb 


37 




noun 


43 




verb 


59 


7 


function word 


50 




pronoun 


60 


a* 


pronoun 


61 




noun (classifier) 


53 


mm 


noun 



Table 9: Word List with Single- and Multiple-Character Words 



It should be understood that adding multiple-character words to the word 
list, and then adding single-character words to the word list is but one exemplary method 
of creating the word list. In an alternative approach, the word list can be obtained by first 
locating the single-character words and then adding to the word list multiple-character 
words. With respect to locating first the single-character words, the approach is similar to 
the approach described above and illustrated in Figure 9; however, steps 903 and 904 are 
omitted. Specifically, in step 902, the facility determines, based on its entry in the lexical 
knowledge base, whether. the character comprises a single-character word. If the character 
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comprises a single-character word, then the facility continues in step 905 to add the 
character to the word list, else the facility continues in step 906 without adding the 
character to the word list. The facility processes each character in the sentence to 
determine if the character is a word by looping through steps 901, 902, 905 and 906. 
5 In. the alternative approach, the facility then processes the sentence to 

locate multiple-character words, and to add such words to the word list. The facility can 
use the method described above with respect to Figure 6. However, since the sentence 
may contain multiple-character words that cannot contain other words, i.e., if the 
IgnoreParts attribute for the multiple-character word is set, then it is beneficial to delete 

10 or remove from the word list those single-character words that make up the multiple- 
character word. Removal of these single-character words from the word list minimizes 
the analysis required of the parser 133.. 

The removal of single-character words from the word list is complicated, 
however, if two multiple-character words, having their IgnoreParts attributes set, overlap. 

15 A generic example will be instructive. Suppose,, a character. sequence ABC is present in 
the sentence under consideration and that the .sequence can comprise multiple character*- 
words AB and BC that have their IgnoreParts attribute set. Suppose also that A, B and C 
are single-character words. There will be a problem if all the single-character words - 
covered by words AB and BC are simply removed from the word list. Specifically, the 

20 word A will be missed if BC is the correct word in the sentence. Likewise, the word C 
will be missed if the word AB is the correct word in the sentence. In either case, the 
sentence will not be parsed, because none, of the "paths" through the sentence is 
unbroken. To prevent this from happening, all the single-character words in a multiple- 
character word will be retained regardless of the value of the IgnoreParts attribute except 

25 for the word(s) covered by the overlapping part. In the generic example described above, 
both words A and C will be retained in the word list; however, B will be removed from 
the word list since it is the overlapping portion of the sequence. Referring to Figure 8, if 
the facility reaches step 802 in the alternative approach, all of the single-character words 
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making up the word candidate would be removed from the list. If the facility, instead, 
reaches step 806, the non-overlapping single-character words will be retained, while the 
overlapping portion(s) will be removed. 

In the method described above, possible overlapping words are located by 
examining the NextChar list for the last character in a word candidate (step 801), and 
ascertaining if a word candidate without its last character is a word (step 804). In an 
alternative approach, overlapping words can be found by examining other information 
that is provided to the parser 133 along with the word list. Specifically, in addition to the 
word list, the parser 133 receives positional information of each word in the word list. 
From the example of Table 3, each of the characters are numbered sequentially from 1 to 
22. Using this positional information, a starting position of the word and an ending 
position of the word are determined for each word in the word list. Referring to the word 
identified in Table 9 by way of example, the word denoted by number "6" would have a 
starting character position of 1 and an ending character position of 2, while the word 
denoted by number "12" would have a starting character position of 2 and an ending 
character position of 3. Single-character words would have a starting character position 
equal to an ending character position. Overlapping words can then be easily ascertained 
by examining the ending character position and the starting character position of possible 
adjacent words in the sentence. Specifically, if the ending character position of a possible 
word in the sentence is greater than or equal to the starting character position of the next 
possible word in the sentence, an overlap condition exists. 

After adding multiple- and single-character words to the word list and 
generating lexical records for those words, the facility assigns probabilities to the lexical 
records that is used by the parser to order the application over the lexical records in the 
parsing process. Figures 10 and 11, discussed below, show two alternative approaches 
used by the facility in order to assign probabilities to the lexical records. 

Figure 1 0 is a flow diagram showing the steps preferably performed by the 
facility in order to assign probabilities to the lexical records generated from the words in 
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the word list in accordance with a first approach. The facility preferably ultimately sets 
the probability for each lexical record to either a high probability value that will cause the 
parser to consider the lexical record early during the parsing process, or to. a low 
probability value, which will cause the parser to consider the lexical record later in the 
parsing process. In steps 1001-1005, the facility loops through each word in the word list. 
In step 1002, if the current word is contained in a larger word in the word list, then the 
facility continues in step 1004, else the facility continues in step 1003. In step 1003, the 
facility sets the probability fipr the lexical record representing the word to the high 
probability value. After step 1003, the facility continues in step 1005. In step 1004, the 
facility sets the probability for the lexical records representing the word to the low 
probability value. After step 1004, the facility continues . in step 1005. In step 1005, if 
additional words remain in the word list to be processed, then the facility continues in- 
step 1001 to process the next word in the word list, else these steps conclude. 

Table 10 below shows the probability values assigned to each. word in the 
word list in accordance with steps shown in Figure 10. It can be seen by reviewing the 
probabilities that the facility assigns the high probability value to at least one word 
containing each character, so that at least one lexical record containing each character is 
considered early in the parsing process. 



Number 


Word 


part of speech 


probability 
value 


6 




noun 


high 


54 


ft' 


morpheme 


low 


55 




noun 


low 


12 




noun 


high 


56 


T 


noun (localizer) 


low 


18 




noun 


high 


24 




noun 


low 


23 




noun 


high 
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Number 


Word 


part of speech 


probability 
value 


57 




noun 


low 


57 




verb 


low 


58 




verb 


high 


. 58 




preposition 


high 


58 




adverb 


high 


37 




noun 


high 


43 




verb ; . - 


-high 


59 


7 


function word 


high 


50 




pronoun 


high 


60 




pronoun 


low 


61 




noun (classifier) 


low 


53 




noun 


high 



Table 10: Word List with Probabilities 



Figure 11 is flow diagram showing the steps preferably performed by the 
facility in order to assign probabilities to the lexical records generated from the words in 
the word list in accordance with a second- approach. In step 1101, the facility* uses the 

5 word list to identify all the possible segmentations of the sentence comprised entirely of 
the words in the word list. In step 1102, the facility selects the one or more possible 
segmentations identified in step 1101 that contain the fewest words. If more than one of 
the possible" segmentations has the minimum number of words, the facility selects each 
such possible segmentation. Table 1 1 below shows the possible segmentation generated 

10 from the word list shown in Table 9 having the fewest words (9). 

Table 11 
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In step 1 103, the facility sets the probability for the lexical records of the words in the 
selected segmentation(s) to the high probability value. In step 1 104, the facility sets the 
probability for the lexical records of the words, not in selected segmentation(s) to the low 
probability value. After step 1 104, these steps conclude. 
5 Table 12 below shows the probability values assigned to each word in the 

word list in accordance with steps shown in Figure 11. It can be seen by reviewing the 
probabilities that the facility assigns the high probability value to at least one word 
containing each character, so that ail east one lexical record containing each character is 
considered early in theparsing process. . . .. 



Number 


Word 


„ : part of speech 


probability 
value 


6 . . . 




Noun. 


high 


54 . 


w . 


-Morpheme . 


low 


55 




Noun 


low 


-12 




Noun 


low 


56 


T 


noun (localizer) 


low 


18 




Noun 


high 


24 


$M 


noun 


low 


23 


^ PI -zx 


noun 


high 


57 


A. 

-25. 


noun 


low 


57 




Verb 


low 


58 




Verb 


high 


58 




Preposition 


high 


58 




Adverb 


high 


37 




Noun 


high 


43 




Verb 


high 


59 : 


T 


Function word 


high 


50 




Pronoun 


high 
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Number 


Word 


part of speech 


probability 
value 


60 




Pronoun 


low 


61 




Noun (classifier) 


low 


53 ' 


mm 


Noun 


high 



Table 12: Word List with Probabilities 



10 



15 



In one broad aspect of the present invention, probabilities can also be 
assigned to overlapping pairs of words. In the generic character sequence ABC statistical 
data may indicate that the probability of the combination of words AB and C is higher 
than the combination of A and BC. Thus, the parser 133 should consider the combination 
AB and C first, whereas the combination of A and BC should not be considered unless no 
successful analysis can be found using AB and C. Statistical data may also indicate that 
one of the possible combinations AB and C, or A and BC is impossible. 

In order to assign relative probabilities to a word in an overlapping pair of 
words, or remove impossible combinations, information is stored in the lexical^ 
knowledge base 132. In particular, additional lists can be associated with many multiple- 
character words in the lexical knowledge base 132. The lists include: 

(1) a first left condition list - the word in this entry would be assigned a 
low probability if it is immediately preceded by one of the characters 
in this list in the sentence; 



(2) a first right condition list - the word in this entry would be assigned a 
low probability if it is immediately followed by one of the characters in 
this list in a sentence; 



20 
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(3) a second left condition list - the word in this entry would be ignored if 
it is immediately preceded by one of the characters in this list in a 
sentence. In other words, if a multiple-character word in the word list 
meets this condition, it will be removed from the word list; and 



(4) a second right condition list - the word in this entry would be ignored 
if it is immediately followed by one of the characters in this list in a 
sentence. In other words, if the word in the word list meets this 
condition, it will be removed from the word list. 
10 It should be noted that each of the foregoing lists may not be present for 

every multiple-character word in the lexical knowledge base 132. In other words, some of 
the multiple character words in the lexical knowledge base 132 may not have any of the 
foregoing lists, while other will have one, some or all of the lists. If desired, other lists 
can be generated based on immediately preceding or following characters. For instance, 
15 lists can be generated to assign high probabilities. The lists are entered, in the lexical 
knowledge base 132 manually. 

In addition to analysis using a lexical knowledge base to resolve 
disambiguation as discussed above, a rule-base disambiguation analysis can also be used 
in combination with the lexical analysis before parsing begins. For example, if a character 
20 string ABCD is present in a sentence where AB, BC and CD are all possible words, word 
BC can be ignored (removed from the word list) if AB does not overlap with a preceding 
word, CD does not overlap with a following word, either A or D is a non-word, and 
neither ABC nor BCD is a word. 

It should be emphasized, however, that there is no logical dependency 
25 between the parser's ability to resolve segmentation ambiguities and the lexical 
disambiguation described above. The elimination of words at the lexical level reduces 
parsing complexity, but is not always a necessary condition for the successful analysis of 
a sentence. Parsing will be successful as long as all of the correct words in a sentence are 
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in the word list provided by the facility 131, and the number of words in the word list is 
not so great as to overburden the parser 133. Therefore, the success of sentence .analysis, 
including correct word segmentation, does not depend on the complete success of lexical 
disambiguation, though the latter will greatly facilitate the former. This allows 

5 development of the facility 131 and the parser 133 independently despite the fact that 
there is interaction between the components. 

Figure 12 is a parse tree diagram showing a parse tree generated by the 
parser representing the syntactic structure of the sample sentence. It can be seen that the 
parse tree is a hierarchical structure having a single sentence record 1231 as its head and 

10 having a number of lexical records 1201-1211 as its leaves. The parse tree further has 
intermediate syntactic records 1221-1227 that combine lexical records each representing a 
word into a larger syntactic structure representing one or more words. For example, the 
prepositional phrase record 1223 combines a lexical record 1204 representing a 
preposition and lexical record 1206, representing a noun. In accordance with step 506 of 

15 Figure 5, the facility identifies the words represented by lexical records 1201-121 1 in the 
parse tree as the words into which the sample sentence should be segmented. This parse 
tree may also be retained by the facility in order to perform additional natural language 
processing on the sentence. 

While this invention has been shown and described with reference to 

20 preferred embodiments, it will be understood by those skilled in the art that various 
changes or modifications in form and detail may be made without departing from the 
scope of the invention. For example, aspects of the facility may be applied to perform 
word segmentation in languages other than Chinese. Further, proper subsets or supersets 
of the techniques described herein may be applied to perform word segmentation. 
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CLAIMS 

We claim: - 

1. A method in a computer system for identifying individual words occurring in a 

sentence of text, the method comprising the steps of: 
for each of a plurality of words : - 

storing an indication of probability of whether the word occurs in 
natural language text as a function of adjacent characters; 
for each of a plurality of contiguous groups of characters occurring in the sentence: , 
determining overlapping possible words; . 

ascertaining probability., based on the -stored indication and adjacent 
characters; and — . . . . 

submitting , the groups of characters determined to be possible words, to a - 
parser with an indication of probability . « 

2. The method of claim 1 wherein for each of a plurality of words having an indication 
of probability, the data structure further comprises an associated list of characters. 

3. The method of claim 1 wherein an indication of probability is low if the word is 
preceded by one of the characters in a list. t 

4. The method of claim 1 wherein an indication of probability is low if the word is 
followed by one of the characters in a list. 

5. The method of claim 1 wherein an indication of probability is zero if the word is 
preceded by one of the characters in a list. 

6. The method of claim 1 wherein an indication of probability is zero if the word is 
followed by one of the characters in a list. 
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The method of claim 1 wherein the natural language is Chinese. 



8. A computer-readable medium storing instructions for a computer -system -for 
identifying individual words occurring in a sentence of text, the instructions comprising the 
steps of: 



for each of a plurality of words: 



storing an indication of probability of whether the word occurs in 
natural language text as a function of adjacent characters; 



for each of a plurality of contiguous groups of characters occurring in the sentence: 
determining overlapping possible words; 

ascertaining probability based on the stored indication and adjacent 
characters; and 

submitting the groups of characters determined to be possible words to a 
.parser with an indication of probability. 

9. The computer-readable medium of claim 8 wherein for each of a plurality of words 
having an indication of probability, the data structure further comprises an associated list of 



10. The computer-readable medium of claim 8 wherein an indication of probability is 
low if the word is preceded by one of the characters in a list. 



characters. 
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1 1 . The computer-readable medium of claim 8 wherein an indication of probability is 
low if the word is followed by one of the characters in a list. 



12. The computer-readable medium of claim 8 wherein an indication of probability is 
zero if the word is preceded by one of the characters in a list. 

13. The computer-readable medium of claim 8 wherein an indication of probability is 
zero if the word is followed by one of the characters in a list. 

14. The computer-readable medium" of claim 8 wherein the natural language is Chinese. 

15. A computer memory containing a word segmentation data structure for use in 
identifying individual words occurring in natural language text, the data structure 
comprising: 

for each of a plurality of words: 

an indication of probability of whether the word occurs in natural 
language text as a function of adjacent characters. 

16. The computer memory of claim 15~wherein for each of a plurality of words having* 
an indication of probability, the data structure further comprises an associated list of 
characters. 

17. The computer memory of claim 15 wherein an indication of probability is low if the 
word is preceded by one of the characters in a list. 

18. The computer memory of claim 15 wherein an indication of probability is low if the 
word is followed by one of the characters in a list. 
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19. The computer memory of claim 15 wherein an indication of probability is zero if the 
word is preceded by one of the characters in a list. 

20. The computer memory of claim 15 wherein an indication of probability is zero if the 
word is followed by one of the characters in a list. 

21. A computer memory containing a word segmentation data structure for use in 
identifying individual words occurring in natural language text, the data structure 
comprising: 

for each of a plurality of characters: 

an identification of characters that occur in the second position of words that 
begin with the character, and 

for words containing the character: 

an identification of the length of the word and the character position within 
the word occupied by the character; and 

for each of a plurality of words: 

an indication of whether the sequence of characters that comprises the word 
may also comprise a series of shorter words; and 

an indication of probability of whether the word occurs in natural language 
text as a function of adjacent characters. 



BNSDOCID: <WO L 9962001 A1_l_> 



WO 99/62001 PCT/US99/1 1856 

-33- 

22. The computer memory of claim 2 Y wherein for each of a plurality of words having 
an indication of probability, the data structure further comprises an associated list of 
characters 
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